Data Description


The data contains more than 60,000 samples of hypothetical mushrooms corresponding to 173 species. Each observation represents a mushroom defined by 21 different attributes.

This dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species


The problem at hand, is a binary classification problem, with ‘class’ being our response variable. The initial dataset with 21 variables is condensed to 15 variables as some variables conveyed the same information, for example ‘cap color’ and ‘gill color’ essentially were the same color, so this way redundant variables have been removed.


Our binary response class is divided into ‘edible’ and ‘poisonous’, with ‘0’ representing edible and ‘1’ as poisonous.


The structure of the dataset is given below, where ‘n’ and ‘m’ are nominal and metrical variables respectively.


1. cap-diameter (m): float number in cm
2. cap-shape (n): bell=b, conical=c, convex=x, flat=f, sunken=s, spherical=p, others=o
3. cap-surface (n): fibrous=i, grooves=g, scaly=y, smooth=s, shiny=h, leathery=l, silky=k, sticky=t, wrinkled=w, fleshy=e
4. cap-color (n): brown=n, buff=b, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y, blue=l, orange=o, black=k
5. does-bruise-bleed (n): bruises-or-bleeding=t,no=f
6. gill-attachment (n): adnate=a, adnexed=x, decurrent=d, free=e, sinuate=s, pores=p, none=f, unknown=?
7. gill-spacing (n): close=c, distant=d, none=f
8. gill-color (n): see cap-color + none=f
9. stem-height (m): float number in cm
10. stem-width (m): float number in mm
11. stem-root (n): bulbous=b, swollen=s, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r
12. stem-surface (n): see cap-surface + none=f
13. stem-color (n): see cap-color + none=f
14. veil-type (n): partial=p, universal=u
15. veil-color (n): see cap-color + none=f
16. has-ring (n): ring=t, none=f
17. ring-type (n): cobwebby=c, evanescent=e, flaring=r, grooved=g, large=l, pendant=p, sheathing=s, zone=z, scaly=y, movable=m, none=f, unknown=?
18. spore-print-color (n): see cap color
19. habitat (n): grasses=g, leaves=l, meadows=m, paths=p, heaths=h, urban=u, waste=w, woods=d
20. season (n): spring=s, summer=u, autumn=a, winter=w

## Classes 'data.table' and 'data.frame':   37065 obs. of  15 variables:
##  $ class               : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
##  $ cap.diameter        : num  1.72 0.86 2.8 1.18 2.46 ...
##  $ cap.shape           : Factor w/ 7 levels "b","c","f","o",..: 7 7 7 6 1 7 7 7 7 7 ...
##  $ cap.surface         : Factor w/ 11 levels "d","e","g","h",..: 8 3 8 8 1 6 8 9 9 9 ...
##  $ cap.color           : Factor w/ 12 levels "b","e","g","k",..: 12 8 11 12 8 11 11 12 6 6 ...
##  $ does.bruise.or.bleed: Factor w/ 2 levels "f","t": 1 1 1 1 1 2 2 2 2 1 ...
##  $ gill.attachment     : Factor w/ 7 levels "a","d","e","f",..: 2 1 2 4 1 2 3 7 5 6 ...
##  $ gill.color          : Factor w/ 12 levels "b","e","f","g",..: 11 8 11 3 9 11 4 12 12 8 ...
##  $ stem.height         : num  7.01 4.25 3.13 3.39 5.03 ...
##  $ stem.width          : num  2.36 1.05 5.69 5.6 6.05 ...
##  $ stem.color          : Factor w/ 12 levels "e","f","g","k",..: 12 4 11 12 9 11 11 11 6 11 ...
##  $ has.ring            : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 2 ...
##  $ ring.type           : Factor w/ 7 levels "e","f","g","l",..: 2 2 2 2 2 2 6 2 5 7 ...
##  $ habitat             : Factor w/ 7 levels "d","g","h","l",..: 3 1 5 1 2 1 5 1 1 1 ...
##  $ season              : Factor w/ 4 levels "a","s","u","w": 3 2 1 3 1 3 1 3 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:24004] 2 4 12 13 16 18 21 22 27 31 ...
##   ..- attr(*, "names")= chr [1:24004] "2" "4" "12" "13" ...
##  - attr(*, ".internal.selfref")=<externalptr>


The distribution of our response class is as follows.

## 
##     0     1 
## 16944 20121

Exploratory Data Analysis


An inference made from the habitat bar-plot is that mushroooms found in the waste are edible, but that definitely does not mean we can eat it, I guess!
And quite not what I expected, most of the mushrooms found in the urban areas seem to be poisonous.


The plot below tells us that most of the samples in our data belong to convex and sunken gill-shaped mushrooms. And it is understood that a mushroom with either of these two shapes has an almost equal chance of being edible or poisonous.


Rings are skirt-shaped parts that are usually under the gill of the mushroom. And from the plot below, it looks like mushrooms with no rings attached to it are unsafe to eat.



Let’s split our data into training and testing sets before building any models. Our training set has 6500 observations while testing set has 1624. The proportion of edibility in our training and testing sets is given below.

## 
##     0     1 
## 13556 16097
## 
##    0    1 
## 3388 4024

Logistic Regression


Below is the logistic regression model.

## [1] NA
## [1] 0.6377496
## [1] 0.7231516
## [1] 0.7857528
## [1] 0.801538
## [1] 0.7938478
## [1] 0.7817053
## [1] 0.7601187
## [1] 0.7238262
## [1] 0.659606
## [1] NA
## [1] NA
## [1] 0.9880716
## [1] 0.9642147
## [1] 0.9463221
## [1] 0.8774851
## [1] 0.791501
## [1] 0.7236581
## [1] 0.638171
## [1] 0.527833
## [1] 0.3822068
## [1] NA


For a threshold of 0.45, we get following accuracy and recall values repectively.

##    predicted_log
##        0
##   0 3388
##   1 4024
##    predicted_log
##        0    1
##   0 2699  689
##   1  839 3185
## [1] 0.7938478
## [1] 0.791501


The ROC- curve is shown below for true positive rate vs false positive rate. It follows the ideal ROC curve path where accuracy is 100%. There is no room for false positives in this situation and the logistic regression model performs perfect.

Linear Discriminant Analysis


Stacked histograms to visualizee LDA:
There is noticeable overlapping from the histograms, indicating they are not accurately classified.

## [1] 0.7862274
## [1] 0.7921352


Confusion Matrix for LDA on testing data:

##       Predicted
## Actual    0    1
##      0 2643  745
##      1  816 3208
## [1] 0.7893956
## [1] 0.7972167
## [1] 0.2106044

Decision Trees


The decision tree classifier is a straightforward approach for a binomial classification problem. Infact, since all the predictors are categorical, it makes it possible for one to interpret the algorithm without any prior knowledge in trees.


This is for the default values of cp in the ‘rpart’ library, which is ‘0.01’

##    test_tree
##        0    1
##   0 2970  418
##   1  138 3886
## [1] 0.9249865
## [1] 0.9657058


With cp=0:


Clear signs of overfitting. Extremely high accuracy levels. Accuracy and Recall for training and testing data respectively.

##               
## train_basetree     0     1
##              0 13498    21
##              1    58 16076
## [1] 0.9973359
## [1] 0.9964051
##    test_basetree
##        0    1
##   0 3378   10
##   1   15 4009
## [1] 0.9966271
## [1] 0.9962724
## [1] 0.003372909


Pruned Tree:

##               
## train_basetree     0     1
##              0 13369   179
##              1   187 15918
## [1] 0.9876572
## [1] 0.9883887
##    test_basetree
##        0    1
##   0 3348   40
##   1   45 3979
## [1] 0.9885321
## [1] 0.9888171
## [1] 0.01146789


Although the model has 99.3%, there should be models performing better than this. This model has some chance of making errors. It can be enhanced by using random forests.

Random Forests

## 
## Call:
##  randomForest(formula = class ~ ., data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.01%
## Confusion matrix:
##       0     1  class.error
## 0 13556     0 0.0000000000
## 1     4 16093 0.0002484935


By default, R picks 500 trees, but anything beyond 200 trees doesn’t seem to reduce the error further.

## [1] 0.9997302
## [1] 0.999503
## 
## Call:
##  randomForest(formula = class ~ ., data = train, mtry = 3, ntree = 200,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.02%
## Confusion matrix:
##       0     1  class.error
## 0 13555     1 7.376807e-05
## 1     5 16092 3.106169e-04
## [1] 0.9997302
## [1] 0.999503


I’m curious to see what the major attributes are in classifying mushrooms.
Initial random forest variable importance vs tuned random forest.


The model does well on the training data having an Out-of_bag error rate of 0%. Having perfect prediction on the testing data would be ideal as well. So, let’s check how the model performs on the testing data.


The confusion matrix for the testing data is given below. There is 100% accuracy of our model, which is exactly what we’re looking for.

## [1] 0.9997302
## [1] 0.999503


Running the decision tree algorithm on just the above important predictors still seems to fetch decent results.

## 
##     0     1 
## 13556 16097
## 
##    0    1 
## 3388 4024

## [1] 0.8688613
## [1] 0.8565421
## 
## Call:
##  randomForest(formula = class ~ ., data = train1, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.76%
## Confusion matrix:
##       0     1 class.error
## 0 13416   140 0.010327530
## 1    85 16012 0.005280487
## [1] 0.9924447
## [1] 0.9925521

ATTACHING CODE BELOW

knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)
library(data.table)
#secondary_data <- read.csv("~/Desktop/FIN_DATA_MINING/data mining data/secondary_data.csv", sep=";")
#mushrooms=secondary_data[,c(-8,-9,-13,-14,-16,-19)]
#str(mushrooms)
#setDT(mushrooms)
#head(mushrooms,5)

secondary_data_shuffled <- read.csv("~/Downloads/files/SecondaryData/secondary_data_shuffled.csv", sep=";")

mushrooms1=copy(secondary_data_shuffled)
mushrooms1=mushrooms1[,c(-8,-12,-13,-15,-16,-19)]

mushrooms1$class[mushrooms1$class==""]=NA
mushrooms1$cap.diameter[mushrooms1$cap.diameter==""]=NA
mushrooms1$cap.shape[mushrooms1$cap.shape==""]=NA
mushrooms1$cap.color[mushrooms1$cap.color==""]=NA
mushrooms1$cap.surface[mushrooms1$cap.surface==""]=NA
mushrooms1$does.bruise.or.bleed[mushrooms1$does.bruise.or.bleed==""]=NA
mushrooms1$gill.attachment[mushrooms1$gill.attachment==""]=NA
mushrooms1$stem.height[mushrooms1$stem.height==""]=NA
mushrooms1$stem.width[mushrooms1$stem.width==""]=NA
mushrooms1$stem.color[mushrooms1$stem.color==""]=NA
#mushrooms1$stem.root[mushrooms1$stem.root==""]=NA
#mushrooms1$veil.type[mushrooms1$veil.type==""]=NA
mushrooms1$has.ring[mushrooms1$has.ring==""]=NA
mushrooms1$ring.type[mushrooms1$ring.type==""]=NA
mushrooms1$habitat[mushrooms1$habitat==""]=NA
mushrooms1$season[mushrooms1$season==""]=NA


mushrooms1=na.omit(mushrooms1)
mushrooms=copy(mushrooms1)
setDT(mushrooms)
rm(mushrooms1)
library(ggplot2)
library(tidyverse)

setDT(mushrooms)
#str(mushrooms)
mushrooms[,class:=factor(class, levels = c('e','p'),labels = c(0,1))]
mushrooms[,cap.shape:=factor(cap.shape)]
mushrooms[,cap.surface:=factor(cap.surface)]
mushrooms[,cap.color:=factor(cap.color)]
mushrooms[,does.bruise.or.bleed:=factor(does.bruise.or.bleed)]
mushrooms[,gill.attachment:=factor(gill.attachment)]
#mushrooms[,gill.spacing:=factor(gill.spacing)]
#mushrooms[,gill.color:=factor(gill.color)]
#mushrooms[,stem.root:=factor(stem.root)]
#mushrooms[,veil.type:=factor(veil.type)]
mushrooms[,has.ring:=factor(has.ring)]
mushrooms[,ring.type:=factor(ring.type)]
mushrooms[,habitat:=factor(habitat)]
mushrooms[,season:=factor(season)]
mushrooms[,gill.color:=factor(gill.color)]
mushrooms[,stem.color:=factor(stem.color)]
str(mushrooms)
table(mushrooms$class)
ggplot(mushrooms,aes(x=habitat,fill=class))+geom_bar(position='dodge')+scale_x_discrete('Habitat', labels=c("woods","grasses","heaths","leaves","meadows","urban","waste"))+scale_y_discrete('Count')  +scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))#+scale_fill_manual(values = c('red','blue'))+theme(panel.background = element_blank())
ggplot(mushrooms,aes(y=cap.shape,fill=class))+geom_bar(position='dodge')+scale_y_discrete('Cap-Shape', labels=c("bell","conical","flat","others","spherical","sunken","convex"))+scale_x_discrete('Count')+scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))#+scale_fill_manual(values=c('red','blue'))+theme(panel.background = element_blank())
ggplot(mushrooms,aes(x=cap.shape,fill=class))+geom_bar(position='stack')+scale_x_discrete('Cap-Shape')+scale_y_discrete('Count')+facet_wrap(~has.ring)+scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))
ggplot(mushrooms,aes(x=cap.shape,fill=class))+geom_bar(position='stack')+scale_x_discrete('Cap-Shape')+scale_y_discrete('Count')+facet_wrap(~has.ring)+scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))
ggplot(mushrooms,aes(x=season,y=cap.shape,col=class))+geom_jitter(alpha = 0.5)+scale_color_manual(breaks = c(0, 1),values = c("green3", "red2"))#+scale_color_manual(breaks=c("edible", "poisonous"),values = c("blue", "red"))
ggplot(mushrooms,aes(x=gill.attachment,y=cap.shape,col=class))+geom_jitter(alpha = 0.5)+scale_color_manual(breaks = c(0, 1),values = c("green3", "red2"))
ggplot(mushrooms,aes(x=ring.type,y=cap.shape,col=class))+geom_jitter(alpha = 0.5)+scale_color_manual(breaks = c(0, 1),values = c("green3", "red2"))
library(caret)
library(ROCR)
set.seed(111)

trainindex=createDataPartition(mushrooms$class,p=0.8,list=FALSE,times=1)
#head(trainindex)
train=mushrooms[trainindex,]
test=mushrooms[-trainindex,]
rm(trainindex)
table(train$class)
table(test$class)
log_model=glm(class~.,data=train,family='binomial')

okok=seq(0,1,0.1)

accuracy=function(x){
  summa=(x[1]+x[4])/(x[2]+x[3]+x[1]+x[4])
  return(summa)
}
recall=function(x){
  summa=(x[4])/(x[2]+x[4])
  return(summa)
}

for(i in okok){
  predicted_log=predict(log_model,test,type="response")
  predicted_log=ifelse(predicted_log>i,1,0)
  
  a=table(test$class,predicted_log)
  
 print(accuracy(a))
  
}

#accuracy.list=c(0.613936,0.6743634,0.7160403,0.7559977,0.7816261,0.7713911,0.727176,0.6864816,0.615246)
accuracy.list=c(0.5875607,0.6567728,0.7231516,0.7481112,0.7394765,0.7054776,0.6868591,0.6454398,0.5769023)
for(i in okok){
  predicted_log=predict(log_model,test,type="response")
  predicted_log=ifelse(predicted_log>i,1,0)
  
  a=table(test$class,predicted_log)
  
  print(recall(a))
}
#recall.list=c(0.9825882,0.9518961,0.9266637,0.8847573,0.8075845,0.7184595,0.5869854,0.464955,0.3097241)
recall.list=c(0.987326,0.9788767,0.9537773,0.889662,0.7691352,0.6336978,0.5072068,0.3727634,0.2266402)
okok=okok[-1]
okok=okok[-10]
log_df=data.table(okok,accuracy.list,recall.list)

pqr=ggplot(log_df,aes(x=okok,y=accuracy.list))+geom_line(col='blue',size=1)
pqr=pqr+geom_line(aes(x=okok,y=recall.list),size=1)
pqr
  predicted_log=predict(log_model,test,type="response")
  predicted_log=ifelse(predicted_log>0.5,1,0)
  a
  a=table(test$class,predicted_log)
  a
  accuracy(a)
  recall(a)
#prediction_lgr_df2 <- predict(model_lgr_df2, data = df2, type="response")
pr_admission <- prediction(predicted_log, test$class)
prf_admission <- performance(pr_admission, measure = "tpr", x.measure = "fpr")
plot(prf_admission, colorize = TRUE, lwd=3)
library(MASS)
linear=lda(class~.,train)

p=predict(linear,train)

ldahist(data=p$x[,1],g=train$class)

train_predict=predict(linear,train)
gg=table(Actual=train$class,Predicted=train_predict$class)
accuracy(gg)
recall(gg)
test_pred=predict(linear,test)
bb=table(Actual=test$class,Predicted=test_pred$class)
bb

accuracy(bb)
recall(bb)
misclas=function(x){
  summa=(x[2]+x[3])/(x[1]+x[3]+x[2]+x[4])
  return(summa)
}
misclas(bb)
#caret::confusionMatrix(data = predict(tree_1, newdata = test, type = "class"), 
 #                      reference = test$class, 
 #                      positive = "edible")
library(rpart)
library(rpart.plot)

set.seed(789)
tree_1=rpart(class~.,data=train,method="class")
rpart.plot(tree_1)
test_tree=predict(tree_1,newdata = test,type="class")
b=table(test$class,test_tree)
b

accuracy(b)
recall(b)
#train_tree=predict(tree_1,newdata = train,type="class")
#table(train_tree,train$class)
#table(test_tree)
#levels(train$class) <- c("edible", "poisonous")
#levels(test$class) <- c("edible", "poisonous")

set.seed(143)
tree_base <- rpart(class~., data = train, method = "class",control = rpart.control(cp = 0))
#summary(tree_base)
#summary(tree_1)
rpart.plot(tree_base)
plotcp(tree_base,col='red',lwd=1)

train_basetree=predict(tree_base,newdata = train,type="class")
c=table(train_basetree,train$class)
c

accuracy(c)
recall(c)

test_basetree=predict(tree_base,newdata = test,type="class")
b=table(test$class,test_basetree)
b

accuracy(b)
recall(b)
misclas(b)

bestcp <- round(tree_base$cptable[which.min(tree_base$cptable[, "xerror"]), "CP"],200)
tree_pruned=prune(tree_base,cp=0.001)
rpart.plot(tree_pruned)

train_basetree=predict(tree_pruned,newdata = train,type="class")
d=table(train_basetree,train$class)
d

accuracy(d)
recall(d)

test_basetree=predict(tree_pruned,newdata = test,type="class")
e=table(test$class,test_basetree)
e

accuracy(e)
recall(e)


#rpart.plot(tree_1)
#tree_1=rpart(class~.,data=train,method="class")
misclas(e)
set.seed(899)
library(randomForest)
mush_rf=randomForest(class~.,data=train,importance=TRUE)
mush_rf
library(party)
plot(mush_rf)
rf_base_pred=predict(mush_rf,newdata = test)
base=table(test$class,rf_base_pred)
accuracy(base)
recall(base)
set.seed(8899)
tree_final=randomForest(class~.,mtry=3,ntree=200,data=train,importance=TRUE)
tree_final

rf_pred=predict(tree_final,newdata = test)

xyz=table(test$class,rf_pred)
accuracy(xyz)
recall(xyz)
varImpPlot(mush_rf, sort = TRUE, 
           n.var = 10, main = "Importance of predictors")

varImpPlot(tree_final, sort = TRUE, 
           n.var = 10, main = "Importance of predictors")
rf_res=predict(tree_final,newdata=test)
pp=table(test$class,rf_res)

accuracy(pp)
recall(pp)
mush_new=mushrooms[,c(1,4,5,7,9,10)]
set.seed(3322)

trainindex=createDataPartition(mushrooms$class,p=0.8,list=FALSE,times=1)
#head(trainindex)
train1=mush_new[trainindex,]
test1=mush_new[-trainindex,]
rm(trainindex)
table(train1$class)
table(test1$class)

set.seed(555555)
tree_random=rpart(class~.,data=train1,method="class")
rpart.plot(tree_1)
test_tree=predict(tree_random,newdata = test1,type="class")
g=table(test_tree,test1$class)


accuracy(g)
recall(g)

set.seed(789789)
new_rf=randomForest(class~.,data=train1,importance=TRUE)
new_rf
test_tree=predict(new_rf,newdata = test1,type="class")
g=table(test_tree,test1$class)

accuracy(g)
recall(g)